Learning epistatic polygenic phenotypes with Boolean interactions

Detecting epistatic drivers of human phenotypes is a considerable challenge. Traditional approaches use regression to sequentially test multiplicative interaction terms involving pairs of genetic variants. For higher-order interactions and genome-wide large-scale data, this strategy is computationally intractable. Moreover, multiplicative terms used in regression modeling may not capture the form of biological interactions. Building on the Predictability, Computability, Stability (PCS) framework, we introduce the epiTree pipeline to extract higher-order interactions from genomic data using tree-based models. The epiTree pipeline first selects a set of variants derived from tissue-specific estimates of gene expression. Next, it uses iterative random forests (iRF) to search training data for candidate Boolean interactions (pairwise and higher-order). We derive significance tests for interactions, based on a stabilized likelihood ratio test, by simulating Boolean tree-structured null (no epistasis) and alternative (epistasis) distributions on hold-out test data. Finally, our pipeline computes PCS epistasis p-values that probabilisticly quantify improvement in prediction accuracy via bootstrap sampling on the test set. We validate the epiTree pipeline in two case studies using data from the UK Biobank: predicting red hair and multiple sclerosis (MS). In the case of predicting red hair, epiTree recovers known epistatic interactions surrounding MC1R and novel interactions, representing non-linearities not captured by logistic regression models. In the case of predicting MS, a more complex phenotype than red hair, epiTree rankings prioritize novel interactions surrounding HLA-DRB1, a variant previously associated with MS in several populations. Taken together, these results highlight the potential for epiTree rankings to help reduce the design space for follow up experiments.


S1.1 epiTree test based on PCS epistasis inference and PCS p-value
Traditional methods of statistical hypothesis testing evaluate uncertainty relative to a null distribution that describes the hypothesized data generating process.Intuitively, these methods address questions of the form: how likely is the observed test statistic if data were drawn from the specified null distribution.
A statistic (and hence data) that is not likely under the null provides evidence against the corresponding null model.In practice, null distributions are often used without justification as to why they are a "reasonable" baseline for such comparisons.This can lead to small p-values that provide strong evidence against a "straw-man" null hypothesis -i.e. a null hypothesis that is a priori known or widely believed to be a poor description of the data generating process.
To address this issue, the PCS inference framework [34] introduced a prediction screening step that evaluates model accuracy on hold-out test set.This step allows one to filter out "straw-man" null hypotheses and focus inference on those that provide a reasonable fit to the data.In order to evaluate the accuracy of red hair phenotype predictions, we split data into training and hold-out test sets, learn models corresponding to epistasis and non-epistasis on the training set, and carry out statistical hypothesis testing on the hold-out test set.By splitting our data in this way, we introduce an additional layer of uncertainty -inference is defined relative to one of many possible sample splits.We address this extra layer of uncertainty through the PCS inference framework.Specifically, we build on general PCS inference ideas to assess the strength of an epistatic effect.
Intuitively, our inference framework captures both the uncertainty quantified by traditional statistical inference and the uncertainty surrounding data perturbations (sample split).In other words, the PCS p-values we propose consider not only distributional assumptions of a null model -as in traditional statistical inference -but also uncertainty surrounding the range of sample splits that could have been considered but were not.By "inflating" p-values to account for a wider range of uncertainty, our approach helps ensure that evidence for a given interaction is more robust to a broader range of modeling decisions.
We call our approach epiTree test, because it is for epistasis discovery and uses decision-trees for both epistasis and no-epistasis models.

S1.1.1 Inference setting
For simplicity, we assume that a candidate interaction consists of two features: A and B. We provide details for the analogous case of general higher-order interactions in Section S1.1.9.Features A and B could be either continuous-valued gene expression features or discrete-valued SNP features.In the following, we focus on the gene expression setting.The case of SNP features is similar and described in Section S1. 1.10.Recall that in our epiTree pipeline, candidate interactions are obtained by running iRF on the training data.The PCS epistasis inference for these candidate interactions, as outlined below, uses the same training data to obtain individual (for each interaction), tree-based null (no-epistasis) and alternative (epistasis) models.On the separate hold-out test data, our aim is to evaluate evidence (i.e. a p-value) for the null hypothesis relative to the alternative hypothesis using test data consisting of n sample points Y = (y 1 , . . ., y n ) ⊤ , A = (a 1 , . . ., a n ) ⊤ , B = (b 1 , . . ., b n ) ⊤ .
The individual response values y i and features a i , b i may be continuous valued or discrete.Here, we consider the binary phenotype of red hair, y i ∈ {0, 1}, where y i = 1 indicates red hair, and continuousvalued gene expression features a i , b i ∈ R.

S1.1.2 Recap of standard logistic regression
The classical translation of Fisher's original definition of epistasis for binary phenotypes seeks evidence against null H 0 relative to alternative hypothesis H 1 under a simple logistic regression model.That is, the data are assumed to have come from a logistic regression model, which is a strong assumption that often leads to unrealistic uncertainty assessments in practice, especially for large samples of data.
Models for the null and alternative can be written as, where β j , j ∈ {A, B, AB, 0} represent coefficients describing the contribution of A, B, the interaction term AB, and an intercept respectively.In some cases, correction for population structure is necessary.Employing standard chi-square approximations for the likelihood ratio test of H 0 vs. H 1 , it is straightforward to obtain a p-value for the inference problem in (3).This (or some slight variation) is the standard p-value typically reported as a measure of an interaction's significance, e.g., with p-value < 0.05/number of interactions tested, reported as significant (with Bonferroni correction).

S1.1.3 Scaling of the response variable
The logit scaling in (3) has a considerable influence on the way an interaction is defined.Clearly, a relation which is multiplicative (non-linear, hence epistatic) on one scale, can be additive (non-epistatic) in another.For instance, a log transform of a multiplicative interaction becomes additive.Hence, Fisher's commonly accepted definition of epistasis critically depends on the selected scaling.In fact, the situation is even more extreme.It is demonstrated in [11] that any multivariate, real valued function with compact support can be written as an additive function for some appropriate scaling.In other words, without fixing a response scale, every function can be rewritten in an additive form.Thus, Fisher's definition of epistasis is only well-defined relative to a selected scale.
Out of statistical convenience, logistic regression models still dominate standard analyses.However, there is no biological justification for the logistic scaling.It has even been reported in empirical studies that this scaling does not always reflect biological function [24,10].
Acknowledging that any epistatic results is only well-defined with respect to a specific scaling, it is natural (see e.g., discussion and references in [6]) to select a canonical scaling for this purpose: the penetrance P (y = 1 | a, b) itself (instead of a logit scaling that transforms the penetrance via the function as in logistic regression) [12,7].We articulate Fisher's definition in the penetrance scale as where f j : R → [0, 1], j ∈ {A, B, AB} are functions that describe the relationship between genes or interactions and penetrance.

S1.1.4 Form of additive and interaction models using training data
While raw gene expression data are often represented as counts (e.g.RNA-Sequencing), commonly available data are typically preprocessed.For example, PrediXcan estimates/imputes inverse normal transformed gene expression rather than raw expression value.In practice, there are a range of standard transforms and pre-processing steps, such that the actual scaling of gene features A and B can be rather arbitrary (see e.g., [2] for some general discussion).These transforms/pre-processing steps are generally not linear, although they are typically monotonic.As a result, we favor models that are invariant to monotone transformations of the data (resulting from potential pre-processing steps).In addition, we want to allow more flexible mappings beyond linear (as in 3) to account for non-linearities that pervade biological systems.There are many different forms of non-linear functional relationships, which may each be useful in describing different epistatic behavior.A particularly flexible class of functions are decision trees, which have the benefit of being both simple to interpret and invariant to monotone feature transformations.In addition, as mentioned previously, it is well known that many biological processes exhibit thresholding dynamics that are mimicked in the Boolean rules used by decision trees [21,16,20,18].Moreover, for the classical epistasis model H 1 in (3) the multiplicative functional form lacks biological justification.Collectively, these considerations motivate our use of decision trees in both the additive, null model H 0 and non-additive, alternative model H 1 .All individual decision trees are obtained via a CART [4] fit on the training data, using backfitting for the additive model.In summary, we translate Fisher's epistasis definition into a (non-parametric) hypothesis inference setting as Decision trees CART A , CART B , CART AB are fit on the training data set (that is, the same data that was used in the iRF candidate selection step) via the CART algorithm (using the R Package rpart).
To fit the additive regression model H 0 in (5), we applied backfitting [13], where one recursively re-fitts additive CART components on the respective residuals with the remaining components.We stopped the recursion when none of the predicted values changed by more than 1% compared to the previous iteration.We further discuss the parameter choice for the CART algorithm in Section S1.4.Finally, we conduct inference using a hold-out test set, following the prediction principle (see next section).

S1.1.5 Predictability principle from the PCS framework
Classical p-values, as obtained via logistic regression-tests for model (3), compare the goodness-of-fit for the null and the alternative hypothesis using all available data.In other words, the null and the alternative models are fit on the same data that are used to evaluate the quality of each fit.In contrast, we learn null and alternative models on one (often randomly sampled) training data set and compare their prediction accuracy on a disjoint (often randomly sampled) set of test or hold-out data.This sample splitting approach is an old idea in statistics and has been used recently to deal with post-selection and so called universal inference problems, see, e.g., [31,17,3,9,32] and is widely used by the machine learning community to guard against overfitting.
Fitted null and alternative models correspond to hypotheses for non-epistasis (null, additive) and epistasis (alternative, non-additive) respectively, which leads to a simple hypothesis testing problem on the hold-out test data (since we fix both, the null and the alternative model from training data).We define predictions from the non-epistasis and epistasis models on the test data with gene expression (or SNP) values (a i , b i ), for i = 1, . . ., n, as: For simplicity, we denote p 0 , p 1 ∈ R n as the n-vectors with components p 0,i := p 0 (a i , b i ) for and p 1,i := In principle, one could apply a Neyman-Pearson test (with likelihood ratio test statistic) in order to obtain a p-value (conditioned on p 0 and p 1 ).However, the models p 0 and p 1 will generally be (at least slightly) mis-specified.For large sample size this can lead to unrealistic extremely small p-values (e.g., in the red hair data example as small as 10 −100 ).This astronomically small p-value is due to the fact that the chi-squared asymptotic distributional approximation to the likelihood ratio test-statistics does not take into account model misspecification, which cannot be ignored for sufficiently large sample sizes (as in the UK biobank data).
In the following we use the PCS inference ideas proposed in [34]  Note that both, p 0 and p 1 , are independent of the observed response Y in the test data used for inference.
Given the observed test data (Y, A, B), we consider a canonical test statistic, the log-likelihood ratio with we conclude that there is not evidence for epistasis and epiTree formally reports a p-value of 1, that is When T (Y ) < 1, i.e. the epistasis model yields an increase in prediction accuracy, we quantify the uncertainty surrounding this improvement by pairing a simulated null data perturbation with bootstrap sub-sampling, as outlined below.
PCS Null perturbation.The PCS framework uses null perturbations to simulate data that respect known structure and compares observed results to results obtained from these simulated data.Recall that p 0 in (6) denotes our estimated penetrance for the test data under the hypothesis of no epistatic interaction.We simulate responses under the no epistasis null hypothesis as which we refer to as the null perturbation.Below, we outline our approach for constructing a reference null distribution of the test statistic.

PCS p-value
To obtain a PCS p-value, we use bootstrap re-sampling over the test data to evaluate where  Table S1: High-level summary of comparison between p-value from logistic regression with polynomial model vs. PCS p-value with CART model for epistasis testing.
In addition to the null responses above, we obtain Y |m ∈ {0, 1} n , an n-vector consisting of the observed responses for the mth bootstrap sample.The PCS p-value is then given by In practice, one can obtain a simple analytic approximation of (11) as outlined in Section S1.1.8.By construction, Y and Y 0 have the same distribution under the null hypothesis, making our proposed PCS inference a valid p-value (conditioned on p 0 ).We summarize the general test procedure epiTree in terms of the computation of the CART based PCS p-values in Figure S1.Table S1 shows a high level comparison of the CART based PCS p-values and p-values obtained from logistic regression.

S1.1.7 Comparison between PCS p-value and classical p-value calculations
In the following, we provide insights into the major differences between calculations of a PCS p-value and a classical p-value.For a given test statistic T (Y ), null hypothesis H 0 : Y ∼ F 0 with null distribution F 0 , and alternative hypothesis H 1 where smaller values of T correspond to stronger evidence against the null (as e.g., for T as in (7) for the epiTree test), a classical p-value is given by where t = T (Y ) is the observed test statistic and the probability is taken over the randomness of Y 0 with distribution F 0 .The classical p-value in 12 does not take into account empirical fluctuation of T (Y ) among different subsamples of data -randomness only enters through Y 0 .
In contrast, the PCS p-value explicitly takes the empirical variability of the test data (Y, A, B) into account through bootstrap sampling, as in (11).This is necessary because sample splitting leads to a particular test set.Bootstrap samples of the test set mimic the other possible test sets that could have been resulted from a different sample split.More precisely, let Fn denote the empirical distribution of the test data.As the number of bootstrap samples M in (11) tends to infinity, Thus, asymptotically, when the bootstrap distribution (empirical distribution) converges to the true underlying distribution of Y , denoted as F , the PCS p-value is equivalent to where Y and Y 0 are independent.For the given observed value t, as in (12), we can rewrite ( 14) to obtain where t is fixed and the probability is taken as in ( 14) jointly over (Y 0 , Y ) with distribution F 0 × F .Note that the difference between ( 12) and ( 15) . This means that the effective null distribution of the PCS p-value corresponds to a convolution of the proposed null distribution, T (Y 0 ), and a centered version of the observed distribution T (Y ).In particular, (15) does not just take into account the distribution of the probabilistic null model F 0 , but also the observed sample distribution of Y .Hence the PCS p-value is more stable with respect to the observed data and possibly also model mis-specifications.This centered, convoluted null distribution also prevents artificially small p-values that are often obtained in the classical setting (12) with large n and slightly missspecified null distributions.
For the CART based PCS p-value, as in the epiTree pipeline, the test statistic T (Y ) as in (7) corresponds to the log-likelihood ratio statistic for the simple null hypothesis y i ∼ Bernoulli(p 0,i ) against the simple alternative y i ∼ Bernoulli(p 1,i ).A classical p-value in this setting corresponds to a Neyman-Pearson test for these simple null and alternative hypothesis.In particular, for the classical Neyman-Pearson p-value the variance of the test statistic T (Y ) under the null distribution H 0 : Y ∼ F 0 is completely independent of the actual observed responses Y .In contrast, the PCS p-value considers the inflated null distribution of T (Y 0 ) − T (Y ), with Y 0 ∼ F 0 and Y ∼ Fn .Hence, the PCS p-value does not just incorporate the distributional variance of F 0 but also the observed empirical variance of the response Y .We stress that the general concept of the PCS p-value does not depend on the particular choice of statistic T (Y ) in (7).As a result, we can calculate PCS p-values not just for the CART based model that we employed for the epiTree pipeline, but also for any other hypothesis testing problem, with an arbitrary test statistic T (Y ) and null distribution F 0 .
In the supplemental Section S1.6, we provide a detailed toy example of a simple linear regression model without intercept, where analytic forms of a classical p-value, as in (12), and PCS p-value, as in ( 14), can easily be derived analytically.Although this setup is overly simplistic, it provides concrete analytical insights into situations when a PCS p-value is beneficial compared with the a classical p-value.
As shown in Section S1.6, the classical p-value has higher power when the data are exactly generated from the hypothetical alternative distribution H 1 .In this case, the PCS p-value's lower detection power originates from the additional uncertainty quantification in the bootstrap sub-sampling -PCS p-value explicitly takes the empirical variation of the observed data sample into account.However, in arguably most real data applications (and certainly for epistasis testing, as stressed throughout this paper), the data generating process cannot be specified exactly for either the null or the alternative model.In such settings, the PCS p-value can be more robust towards such misspecification.Specifically, we show (both, analytically and in simulations) that classical p-values can result in severe false positive rates when the data are generated from a slight variation of the hypothesised null distribution.On the other hand, the PCS p-value is generally more robust and, in contrast to the classical p-value, does not result in an increased type 1 error.Moreover, we also provide an explicit example in Section S1.6 where the data are generated from a slight variation of the hypothesised alternative distribution.In this case, we observe that the PCS p-value can have a higher detection power compared to the classical p-value.This originates form the fact that the PCS p-value explores the full empirical distribution of the observed responses, and is therefore able to detect deviations from the hypothesized null distribution of the statistic T (Y ), which typically cannot be detected from the a single observed t, as in the classical p-value.In summary, while the standard p-values work well in settings where everything is specified correctly, when models are misspecified PCS p-values can be favorable -they provide more stable type 1 error control and can have a higher detection power in the presence of outliers.

S1.1.8 Analytic approximation of PCS p-value in the epiTree test
In order to obtain an analytic approximation of (11) note that where p 1,i and p 0,i denote the ith component of the vectors p 1 and p 0 , respectively, and y 0,i denotes the ith component of the n-vector Y 0 , and For each bootstrap sample m, we draw n i.i.d.indexes I(m) 1 , . . ., I(m) n uniformly from the set {1, . . ., n}.
Thus we can write Conditioned on the data (with the only randomness coming from the bootstrap sampling via the random indexes I(m) 1 , . . ., I(m) n ), the M different terms in equation ( 11) are independent and identically distributed Bernoulli random variables.Hence, by the law of large numbers and using equation ( 17), we get that for infinitely many bootstrap samples (M → ∞) PCS p-value = P (I1,...,In) where randomness is over the n i.i.d.random indexes I 1 , . . ., I n drawn uniformly at random from the set {1, . . ., n}.Again, conditioned on the data, we have that the n random variables δ I1 , . . ., δ In are independent and identically distributed, each following a uniform distribution on the set {δ 1 , . . ., δ n }.
Thus, they have mean and variance where E I1 (•) mean that we take the expectation w.r.t. the random bootstrap sample.Hence, it follows from the central limit theorem that X := √ n( 1 n n i=1 δ Ii − µ)/σ converges in distribution to a standard Gaussian as n → ∞ (in our case n = 4K).Thus, with where Φ denotes the cumulative distribution function of the standard normal distribution.Note that it follows from ( 21) that PCS p-value < 0.5 if any only if µ < 0, which is equivalent to T (Y ) < T (Y 0 ).Thus the PCS p-value can only be significant when the improvement in prediction (of the epistasis model over the non-epistasis model) for the observed data are greater than for the null perturbation.Further, note one can upper bound the rate of convergence in (21) using the nonuniform Berry-Esseen theorem: and Then there exists some constant L such that for all z ≥ 0 Therefore, from (18) and Theorem 1 we get that when µ < 0 with . Thus, the approximation error from the CLT for the PCS p-value decreases of order n −2 .In our case n = 4, 000, i.e., n −2 is of order 10 −7 .We note, however, that one also has to take the exact value of C into account (which depends on the data).
Averaging over null perturbation For the PCS p-value approximation above, we conditioned on both the observed responses Y as well as the (simulated) null perturbation Y 0 (see definition of δ i in ( 16)).In practice, the reported PCS p-value should be independent of any randomness introduced by Y 0 .Below, we detail two different ways to achieve this: 1. Average the PCS p-value over infinitely many random realizations of Y 0 .In other words, when computing the expectation and standard deviation of δ Ii in equation ( 19) and (20), we average over the randomness of Y 0 , which is equivalent to replacing the y 0,i term with its expectation p 0,i .
This gives and with where E I1,Y0 (•) means that we take the expectation w.r.t. the random bootstrap index I 1 and the random null perturbation Y 0 .The PCS p-value approximation is then given by Φ( √ nµ/σ).

Alternatively
. This is equivalent to replacing δ i by δ i from ( 24) when computing the mean µ and standard deviation σ in (19) and (20).Note that the only different between δ i and δ i is that the random quantity y 0,i gets replaced by its expectation p 0,i .In particular, conditioned on the observed data (Y, A, B), the quantity δ i is random, as it depends on the random null perturbation y 0,i , and δ i is not random.
Note that the only difference between the two different approaches is in the variance term σ 2 , where the former approach results in a slightly larger variance via the additional For the red hair analysis, we found that both approaches give the same magnitude for the p-values.The p-values presented in the results section for the red hair analysis correspond to the latter approximation.

S1.1.9 Higher order interactions
In the following, we discuss how to extend inference from pairwise to higher-order interactions.Inference for an interaction in the logistic regression model is based on the respective coefficients in a polynomial interaction model.Thus, for higher-order interactions one typically adds higher-order polynomial terms to the model.For example, to test an order-three interaction among genes A, B, C, the logistic regression hypothesis problem in equation ( 3) translates to If we replace the polynomial terms in (25) with CART interaction terms and replace the logistic scale by the penetrance scale, we obtain Note, however, that in (27), without loss of generality, the lower order CART terms can be incorporated into the respective higher-order CART terms 1 .Hence, for the CART model the direct comparison to (25) is to rewrite equation ( 5) as More generally, the null model H 0 to test for an order-d interaction can be written as an additive model of d CARTs each taking a subset of size d − 1 and the alternative epistasis model can be written as a single CART taking all the d features as input.
From a biological perspective, it is not clear whether H 0 in (28) -where each of the genes interact with every other gene -should be interpreted as no d th order epistasis.Across the subcellular, cellular, tissue and organismal levels, biological processes are often found to be subdivided into multiple component proteins, such as an enzyme complex or signaling cascade [25].Considering the genetic basis of human phenotypes, the linkage of multiple genetic variants spanning protein subunits but influencing a specific pathway is an increasingly common finding [8,28,14].Modeling the relationship between all d components in H 0 of (28) yields the opportunity to more directly capture the redundancy and non-linear behavior of biological pathways and protein signaling which characterizes observations of epistasis in complex organisms [27,29].To illustrate this, consider the relationship between the different features as a graph, where the nodes of the graph correspond to the individual features and edges between features correspond to pairwise interactions.For the null model (no d th order epistasis) H 0 in (28), each pair of features is connected via an edge in the graph, as illustrated in Figure S2.From the perspective of a biological pathways, S2 implies that all genes belong to a shared pathway, which arguably represents a d th order biological interaction one may be interested in detecting.
By framing inference w.r.t. to prediction, the epiTree p-values have greater flexibility derived from learning null and alternative models on separate training data.In particular, for the epiTree p-value 1 To see this, note that any additive CART model of the form CART A (a) + CART AB (a, b) can be rewritten as a single CART term CART AB (a, b), where CART AB (a, b) is the same as CART AB (a, b) but with additional splits at the tip nodes that correspond to the splits of CART A (a).The same holds true for higher order interaction CART models.null model, instead of considering a null model as in (28) and Figure S2, we can learn a partition of the tested interaction features, such that features between different groups do not interact.More precisely, we consider all partitions of the features into two disjoint sets (corresponding to two unconnected graphs), such that features within each set can interact, but features between the two sets can not.The final null model H 0 is then selected via a prediction screening, i.e., the partition with the highest prediction accuracy on the test data is selected as the null model H 0 which is tested against the interaction model H 1 of d-th order epistasis.More precisely, we consider the CART null and alternative model to test a 3rd-order interaction as follows We illustrate this case of 3rd-order epistasis null model H 0 in Figure S3 and 4th-order epistasis null model H 0 in Table S2.Similarly, for a dth-order interaction with features A 1 , . . ., A d , we consider where I 1 and I 2 correspond to the partition which yield the highest prediction accuracy on the test data. 2  Finally, note that in the case of pairwise interactions (i.e. S = {A, B}) the definitions ( 28) and ( 29) are the same.
The CART algorithm, which is applied to fit the tree components of H 0 and H 1 in (5) has several tuning parameters, such as the complexity parameter and the minimal number of observations required at each leave node, which control how deep the tree is grown.In general, the PCS p-value depends on the particular trees which are tested in ( 5) and thus, depends on the CART tuning parameters.Our aim was to select the trees in ( 5) as simple as possible, but deep enough such that they capture the potential interaction behaviour of the considered genes.To this end, we controlled the depth of the tree using the complexity parameter (cp in rpart), which controls the minimum improvement in the model needed at each node.The larger cp is, the shallower is the fitted tree.If cp is very large, the resulting interaction tree CART AB in ( 5) might split only on one of the two genes A or B. In that case, it is clearly not possible for CART AB to capture interaction behaviour of A and B. Analog holds for higher-order interactions (recall Section S1.1.9),if the resulting tree does not contain all genes of that interaction.Therefore, we selected the complexity parameter (cp in rpart) for the interaction model CART AB adaptively, namely, just small enough such that the interaction model included all genes (but not smaller than 0.01, which is the default parameter in the rpart implementation).More precisely, we start with cp = 0.01 and then, as long as not all features get split on in the tree, we replace the current cp value by cp /1.1.For the additive components, CART A and CART B , the complexity parameter is chosen to be the same as for the interaction component.In this way it is guaranteed that the tree for the interaction model, CART AB , is sufficiently deep to capture potential interaction behavior.Moreover, we prevent overfitting by not making cp smaller than necessary for all genes to appear in the tree and generally not smaller than 0.01.At the same time, this guarantees that the complexity of all trees, CART AB , CART A , CART B , is comparable.
We analyzed the stability of the PCS p-value with respect to this choice of tuning parameter in the CART algorithm.This is illustrated for the candidate interaction DBNDD1 and ASIP in Figure S4.
The top plot shows on the x-axis the complexity parameter of the CART algorithm which was used to fit the models in ( 5) and on the y-axis it shows the respective p-value.When the complexity parameter is too large (left side of the plot), the resulting interaction tree does not contain any splits on ASIP (as illustrated in Figure S4) and thus, cannot capture interaction behaviour between ASIP and DBNDD1.
The dashed line corresponds to the largest complexity parameter, such that the tree splits on both DBNDD1 and ASIP.This is the tree which we considered as interaction model for the PCS p-value (see the dashed line in Figure S4 and the tree marked with a red arrow), leading to a highly significant PCS p-value in this case.As the complexity parameter decreases, the tree is grown deeper (as illustrated in Figure S4) and the p-values changes, as can be seen in the jumps in Figure S4).However, we note that although the trees change, the p-values remains significant for a wide range of smaller complexity parameters.Figure S4 shows the same plots for all other candidate interactions which were detected by the iRF algorithm (recall Figure 5).In general, we observe that when the PCS p-value is significant, then it also remains significant over a wide range of complexity parameters.Figure S6 summarizes this observations, where the x-axis shows the PCS p-value of the different candidate interactions and the yaxis shows the fraction of complexity parameters3 where the resulting p-value is smaller than 0.01.This implies that significant PCS p-values appear to be robust to reasonable changes of tuning parameters of the CART algorithm.On the other hand, we note that for those interactions where the PCS p-value was not significant, there are still some interactions where smaller complexity parameters consistently result in significant p-values.In practice, one might consider these candidate interactions for further investigation, but we stress that such a screening through complexity parameters does not account for multiple testing.

S1.5 Details on parameter choices for iRF, ranger, and penalized logistic regression
In the following, we provide further specific details on the choice of implementation and software parameters made in our analysis of the red hair phenotype for UK Biobank data with the epiTree pipeline.

Penalized logistic regression:
The penalized logistic regression model was fit with the R package glmnet, using default parameters for classification.The tuning parameter lambda was selected via cross validation using the cv.glmnet function of the glmnet R package.
Random forest: The RF model was implemented in the R package ranger with default parameters for classification.We note that our iRF model was also fit using ranger.As a result, differences between iRF and RF can be directly attributed to iterative feature re-weighting, which is a form of "soft" regularization (see more details below).
iRF: For the iRF model, for k = 10, 50, 100, 500, 1000, 2500, we ran the first iteration with default parameters for classification.We then applied hard thresholding on the top k features, with k selected by minimizing out-of-bag error, followed by three iterations of iterative re-weighting (i.e.soft-thresholding).
For the gene level analysis k = 50, while the subsequent SNP level analysis used k = 1, 000.After the final iteration, we then searched for interactions using RIT, where we grew many shallow trees, in order to obtain a large set of candidate interactions (ntree = 5, 000, depth = 3, child = 5).To evaluate stability of candidate interactions, we performed n.bootstrap = 50 bootstrap replicates and only included candidate interactions that were consistently filtered in at least 50% of the replicates.

Not all features in interaction tree!
Shallowest tree such that all features appear in tree : considered for PCS p-values!Left: a tree for a large cp value, which only splits on DBNDD1 but not on ASIP and thus cannot capture interaction behaviour; middle: the shallowest tree such that both, DBNDD1 and ASIP, appear in the tree, this tree is considered for the PCS p-value; right: a deeper tree which corresponds to a smaller cp value.

S1.6 Illustrative example of PCS p-value for simple linear regression with one predictor
Here we compare classical and PCS p-values using a toy example: a linear model with a single feature or predictor variable, given unit variance, and no intercept.In this simple setting, one can easily derive closed form solutions for the classical p-value and the PCS p-value.We find that in situations where the data is generated exactly from the hypothesized alternative model, the PCS p-value has reasonable detection power, but the classical p-value is superior with fewer false negatives than the PCS p-value.
However, when models are misspecified, the PCS p-value is more robust, leading to fewer false positives under a misspecified null model and fewer false negatives under a misspecified alternative model.In most practical situations the hypothesised models are misspecified to some extent, especially in the big data era.This suggests that the PCS p-value is favorable for most real data applications.
We consider a test data set {(y 1 , x 1 ), . . ., (y n , x n )} with response values y i ∈ R and a single, fixed covariate x i ∈ R, along with a separate training data set {(ỹ 1 , x1 ), . . ., (ỹ n , xn )}, where we have assumed for simplicity that the training and test data are of the same size.Further, to ease notation, in the following we assume that Our goal is to obtain and compare classical and PCS p-values for the hypothesis testing problem ∼ N (0, 1) vs. H 1 : y i ind.
Test statistic: We consider the test statistic as in a classical z-test, that is with Y = (y 1 , . . ., y n ) ⊤ .The classical p-value from the z-test is given by where, again, Φ denotes the cumulative distribution function of the standard normal distribution.In For the null hypothesis H 0 there are no unknown parameters that need to be estimated.

PCS Prediction screening:
In the prediction screening step, we use the test data to evaluate the prediction error from the null model and the alternative model, where here we estimate the alternative model using the training data as in (33).For the loss function, we consider the negative log-likelihood, which equals the squared loss in the Gaussian case (30).More precisely, the prediction screening as in (8) yields PCS Null perturbation: When the prediction screening does not result in a failure to reject the null hypothesis, we simulate from the null perturbation.In this simple setting, the null hypothesis corresponds to i.i.d.Gaussian white noise.Therefore, we perturb our true responses y i as follows: where I n×n denotes the n × n identity matrix.

PCS p-value:
We obtain a PCS p-value with test statistic T (Y ) as in (31) and null perturbation Y 0 as in (35) via bootstrap samples as in (11).Note that in this case we can approximate the PCS p-value as in ( 14) such that PCS p-value ≈ P n i=1 x i (y i − y 0,i ) < 0 . (36)

S1.6.1 Simulation studies
In the following simulations, we compare the classical p-value as in (32)   control over type I error.Of course, we note that in real data settings the "true" model is rarely known and thus this performance is not expected in practice.
First, we consider the setting where the responses are generated from the H 0 distribution as in (30) with y i ∼ N (0, 1) independent for all i = 1, . . ., n.As seen in the left plot in Figure S7, both, the classical p-value and the PCS p-value follow a sub-uniform distribution, indicated by the fact that their cumulative distribution function (cdf) is smaller or equal to the identity line.The PCS p-value is a bit more conservative than the classical p-value under the exact null distribution, indicated by the fact that the cdf of the PCS p-value is smaller or equal to the cdf of the classical p-value.However, for smaller quantiles (that are of dominant interest in practice) the different between the PCS p-value and the classical p-value distributions is almost negligible under the exact null distribution.For example, in our Monte Carlo simulations we found that the PCS p-value is smaller than 0.05 in exactly 5% of cases, which coincides with the classical p-value.
Second, we consider the situation where the responses follow a model as in H 1 in (30) with y i ∼ N (c x i , 1) independent for all i = 1, . . ., n, for some fixed c > 0. The right plot in Figure S7 shows the respective p-value distributions for c = 4.In this case, we find that the classical p-value outperforms the PCS p-value, although the PCS p-value has high power.For example, at a nominal level of α = 0.05 the classical p-value correctly rejects the null hypothesis in 99% of cases, but the PCS p-value rejects in only 64% of cases.Similar, at nominal level α = 0.1 the classical p-value rejects in 99.7% of cases, but the PCS p-value in only 74.9% of cases.Intuitively, the reason for this is that exploring the full distribution of the test statistic T (Y ) via the bootstrap sampling results in an additional variance term, which leads to a weaker detection power when the observed responses exactly follow the specified alternative model where σ denotes the sigmoid function and π A , π I baseline success probabilities for active and inactive samples respectively.We set π A = 0.75 and π I = 0.25 • {proportion of active samples} for all models.
We selected features in active interactions S k using an iterative sampling strategy to prevent high levels of correlation between interacting features.Specifically, the first feature was sampled with uniform probability and subsequent features with probability inversely proportional to their maximum absolute correlation over previously selected features.For all models, we set the size of each interaction component to |S k | = 3, k = 1, 2 (i.e. an order-3 interaction).
We note that the different response generating models represent a range of interaction forms.The AND model (π (AN D) ) corresponds to a single interaction that is either active or inactive for all observations.The OR model (π (OR) ) corresponds to two distinct interactions that can be viewed as heterogeneous mechanisms governing response behavior.The ADD model (π (ADD) ) corresponds to two interactions terms that do not interact with one another.Finally, the multiplicative model (π (M U LT ) ) corresponds to a multiplactive interactions.
For each model above, we generated a list of candidate interactions using iRF and computed p-values using logistic regression, pRF, and epiTree.To assess the variability in p-values, we repeated the inference step 25 times on new responses generated from the same set of active interactions.Figure S21 reports the results from each response generating model.We report the distribution of − log 10 (p) for the following groups: (i) active interactions -the true response generating interaction (i.e. S = S 1 or S = S 2 ) (ii) active interaction subsets -a strict subset of the true response generating interaction (i.e. S ⊂ S k ) (iii) active features -features used in active interactions but not all part of the same interaction term (i.e. S ⊆ S 1 ∪ S 2 , S ̸ ⊆ S 1 , S ̸ ⊆ S 2 ) (iv) inactive interaction -includes features that do not appear in any interaction term (i.e. S ∩ (S 1 ∪ S 2 ) ̸ = ∅).We note that the resolution of p-values for pRF depends on the number of permutations run, which we set to 1000 in our simulations.For consistent comparison across all methods, we threshold reported p-values below at 10 −4 .Below we summarize several key findings.
• AND/OR models: All methods consistently detect (at significance level 0.05) active interactions.
epiTree and logistic regression also detects subsets of active interactions while pRF generally does not.We find that active interactions fall at the top of the p-value ranked list for epiTree and pRF but not for logistic regression.In other words, the interactions deemed most important (with respect to p-value) by epiTree and pRF included only features that were truly interacting.
• AND model: pRF consistently detects false positives in the AND model (average p-value of inactive interactions ∼ 0.025).We find that this is due to the fact that pRF p-values are generally low for interactions that contain both an active interaction and other non-interacting features.In contrast, epiTree and LR rarely detect these false positive interactions.The average p-values for epiTree and LR are consistently > 0.25.
• ADD model: epiTree is the only method that consistently detects active interactions (and their subsets) in the ADD model.We find that active interactions are at the top of the p-value ranked list for epiTree in the ADD model but not for pRF or logistic regression.
• MULT model: Logistic regression is the only method that consistently detects subsets of multiplicative interactions.We note that no method computes p-values for the active interaction since it was not detected by iRF, and thus not contained in the set of candidate interactions.While epiTree does not detect subsets of the active multiplicative interaction at a significance level of 0.05, we find that such subsets still fall at the top of the p-value ranked list for epiTree.
In summary, epiTree performs comparably to or better than competing methods under the AND, OR, and ADD models.Specifically, epiTree consistently recovers true interactions with generally lower false positive rates than pRF or logistic regression.While logistic regression slightly outperforms epiTree under the multiplicative model, all methods perform poorly in this setting.

S1.8 Further details on the multiple sclerosis case study
For the multiple sclerosis case study we considered a total of 502,369 individuals from the UK Biobank cohort, aged between 40-69 at recruitment (MS status defined using ICD-10 code G35).We searched for gene-level interactions in a balanced, random sample of cases and controls (2083 MS individuals and 2083 controls).To help assess the generalizability of our results to unobserved data, we performed a random sample split into training and test sets with 3333 training samples and 833 test samples.We modeled MS using ∼62,000,000 common variants (MAF > 0.01) imputed to the Haplotype Reference Consortium (HRC) and UK10K reference panels from ∼800,000 directly genotyped variants, which were obtained by UK Biobank using one of two similar arrays [5].Details regarding the ascertainment and quality control of these genotypes have been previously described [5].We estimated gene expression levels in brain cortex tissue from the individual SNP data using CTIMP [1].
Figure S9: Response surface for ASIP -DEF8, otherwise as Figure 7 S2 Supplemental Figures         Inference was conducted on interactions recovered by iRF.Note: iRF did not detect the full active interaction for the multiplicative model.Candidate interactions are grouped by whether they correspond to (i) a true interaction (ii) a subset of the true interaction (iii) a non-active interaction that includes only active features (iv) a non-active interaction that includes inactive features.Error bars show 1 standard deviation.The grey dashed line corresponds to a significance threshold of 0.05.
bootstrap indicates randomness relative to bootstrap sampling of the test data.More precisely, we generate bootstrap samples m = 1, . . ., M , each of size n (= sample size of the test data) drawn i.i.d. with replacement from the empirical distribution of our observed test data.For each bootstrap sample we obtain p 0 |m ∈ [0, 1] n , a vector of estimated penetrance under the null.Using p 0 |m, we simulate null responses for bootstrap sample m under the null perturbation (9) as Y 0 |m ∼ Bernoulli(p 0 |m).
, for a given bootstrap sample m in (17), instead of comparing T (Y |m) to T (Y 0 |m) for some particular random realization of Y 0 |m, one can compare T (Y |m) to the average realization of

Figure S4 :
FigureS4: Obtained p-value (y-axis) for different choices of the complexity parameter cp (x-axis) in the CART implementation of rpart for the candidate interaction between ASIP and DBNDD1.The dashed line corresponds to the cp value which was considered for the PCS p-value and the gray highlighted area corresponds to those cp values where the resulting CART interaction model of ASIP and DBNDD1 does only split on DBNDD1 but not on ASIP.Three different trees, which correspond to the interaction model of some cp values are shown.Left: a tree for a large cp value, which only splits on DBNDD1 but not on ASIP and thus cannot capture interaction behaviour; middle: the shallowest tree such that both, DBNDD1 and ASIP, appear in the tree, this tree is considered for the PCS p-value; right: a deeper tree which corresponds to a smaller cp value.

Figure S5 :
Figure S5: Obtained p-value (y-axis) for different choices of the complexity parameter cp (x-axis) in the CART implementation of rpart, analog as in Figure S4.

Figure S6 :
FigureS6: The x-axis shows the PCS p-value of the candidate interactions which were detected by the iRF screening procedure, as in Figure5.The y-axis shows the fraction of complexity parameters where the resulting p-value is smaller than 0.01.For this fraction we considered all complexity parameters cp on a log-scale up to one unit smaller than considered in the PCS p-value, that is, the range of cp values shown right of the dashed lines in FigureS5.
the following examples we compare this classical p-value to the respective PCS p-value.For the PCS p-value we first obtain an estimate for both, the hypothesis model and the alternative model, from the training data.For the alternative H 1 we obtain an estimate for the coefficient c from the training data {(ỹ i , xi ), i = 1, . . ., n} as ĉ = n i=1 xi ỹi .

Figure S7 :
Figure S7: Distribution of classical p-value and PCS p-value for the hypothesis testing problem in (30) with test statistic as in (31) for n = 1, 000 and a single dependent variables x i drawn independently from a uniform distribution on [0, 1].The black solid line corresponds to the PCS p-value, the black dotted lines to the classical p-value, and the gray solid line to the identity.Results are obtained from 1, 000 Monte Carlo runs.Left: responses y i as in H 0 in (30); Right: responses y i as in H 1 in (30) with c = 4.

Figure S14 :
Figure S14: Decision trees for XPOTP1 -PKHD1 interaction.The decimal digits at the tip nodes correspond to the predicted probability of red hair.The percentage at the tip nodes corresponds to the percentage of training observations falling into this tip node.

Figure S15 :
FigureS15: List of stable gene level interactions found by iRF (stability score > 0.5).The first column shows the PCS p-value (orange) and p-value from logistic regression (gray) on a -log10 scale.The second column shows the prediction error (cross-entropy) on the test data of the learned CART (orange) and logistic regression (gray) models for both, no-epistasis (light color) and epistasis (dark color).The black vertical line shows the prediction error achieved by iRF using all the gene features simultaneously.

Figure S16 :
Figure S16: Same as Figure S15, but for the variant level.

Figure S17 :
Figure S17: Same as Figure S15, but for the top 20 PCS p-values among inter chromosome pairwise brute force search of top 50 iRF genes (in terms of Gini importance).

Figure S18 :
Figure S18: Same as Figure S17 but restricted those interactions which are not between chromosome 16 and chromosome 20 genes.

Figure S19 :
Figure S19: Same as Figure S17 but for top p-values from logistic regression.

Figure S20 :Figure S21 :
Figure S20: Same as Figure S18 but for top p-values from logistic regression.

Figure S22 :
Figure S22: Location of the coding region for the 16 genes which appear in the stable gene level interactions found by iRF that contain only genes on chromosome 16, together with the location of the MC1R gene.

Figure S23 :
FigureS23: Number of interactions per interaction-order discovered by iRF within individual screening steps (red hair case study).The left plot corresponds to the gene-level analysis and the right plot to the SNP-level analysis.The x-axis shows the order of the interactions and the y-axis the number of discovered interactions for a specific order.Red: Interactions from the initial iRF screening.Green: From those, interactions which are stable among bootstrap replicates.Blue: From those, interactions which are not in high linkage disequilibrium.
(a) Results for analysis restricted to female subjects.(b)Results for analysis restricted to male subjects.

Figure S24 :
Figure S24: Analog as Figure 3 A., but for the iRF prediction model and competitors trained on the data from the multiple sclerosis case study in Section S1.8.

( a )
Results for analysis restricted to female subjects.(b) Results for analysis restricted to male subjects.

Figure S25 :
Figure S25: List of stable gene level interactions found by iRF (stability score > 0.5) in the multiple sclerosis case study from Section S1.8, together with PCS p-value.Only interactions with PCS p-value < 0.1 are shown.
1−yi, where p • can be p 0 or p 1 , and we have dropped dependence on A, B in our notation for simplicity.For given labels Y , T (Y ) measures how much more likely observed labels are under the null (p 0 ) compared to the alternative (p 1 ).Smaller values of T (Y ) correspond to greater evidence for the epistasis model.PCS Prediction screening.Note that both, p 0 and p 1 , were obtained from training data only.
and the PCS p-value as in (36) under different probabilistic generating models for the responses y i (and ỹi , respectively).We note that the PCS p-value requires a separate training data set, which is not required by the classical p-value.In order to provide a fair comparison in all of our simulations studies , we use all available data as test data for the classical p-value and for the PCS p-value we randomly split all available data into a training and a test data sets of equal size.Below, we summarize our results.More details, Classical setting -correctly specified model.Here we compare PCS and classical p-values under the correctly specified model where data are generated exactly as indicated by (i) the null hypothesis H 0 and (ii) the alternative hypothesis H 1 .In these settings, the classical p-value outperforms the PCS p-value.That is, the classical p-value has higher power under the alternative and maintains the correct